Maria Bochenek
Consider a following model:
$$f(x_1, x_2) = (x_1 + x_2)^2$$Assume that $X_1, X_2 \sim \mathit{U}[-1,1]$ and $x_1=x_2$ (full dependency)
Calculate $PD$ (Partial Dependence) profile for variable $x_1$ in this model.
Extra task if you do not fear conditional expected values: Calculate $ME$ (Marginal Effects) and $ALE$ (Accumulated Local Effects) profiles for variable $x_1$ in this model.
Let's note that marginal probability distibutions for $X_1$ and $X_2$ are given by $$\begin{align*} h_{X_1} (x_1) &= \begin{cases} \frac{1}{2}, \text{ for} -1 \leq x_1 \leq 1 \\ 0, \text{ otherwise} \end{cases}\\ h_{X_2} (x_2) &= \begin{cases} \frac{1}{2}, \text{ for} -1 \leq x_2 \leq 1 \\ 0, \text{ otherwise} \end{cases} \end{align*}$$
Joint probablity distribution of $X_1$ and $X_2$ must satisfy
$$h_{X_1, X_2} (x_1, x_2) = h_{X_1 | X_2} (x_1|x_2) h_{X_2}(x_2) = h_{X_2 | X_1} (x_2|x_1) h_{X_1}(x_1).$$Moreover conditional distribution of $X_2$ given $X_1 = z$ is given by $$h_{X_2 | X_1=z} (x_2|x_1=z) $$
In this assigment we will again use QSAR-biodegradation dataset LINK for binary classification task. The dataset has 17 features and 1055 rows. We changes classes labels from $\{1, 2\}$ to $\{0, 1\}$ for consistency with the procedure in HW1 (it was necessary to use BCELoss while training Neural Network Classifier). Class 0 has $699$ instances and class 1 has $356$, thus this dataset's imbalance ratio is equal to $1.96$. The task is to distinguish ready (class 1) and not ready (class 0) biodegradable molecules based on molecular stucture descriptors.
We selected sklearn.ensemble.GradientBoostingClassifier, which was one of better performing models based on analysis from HW1. To extend the scope of our analysis we included sklearn.linear_model.LogisticRegression as an example of model with worse performance for comparison.
We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We performed feature scaling using sklearn.preprocessing.StandardScaler. Model performance is presented in table below.
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.85782 | 0.92568 |
| LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 |
We selected 2 observations from test datasets, one from each class. The selected observations are presented below.
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 261 | 0.570528 | -1.02671 | -0.199308 | 2.14755 | 0.090398 | 0.734119 | 0.567574 | -0.658983 | 0.114451 | -0.294956 | 0.0880715 | -0.00262827 | 0.152555 | 0.565992 | -0.49876 | -0.447733 | -0.311315 | 1 |
| 679 | 1.2897 | -0.922888 | 0.170087 | 2.2257 | -0.573715 | -0.644963 | 0.699339 | -1.06074 | -0.113295 | 0.506875 | 1.38873 | -0.0335233 | -0.751856 | -0.876388 | -0.267852 | -1.32678 | -0.50768 | 0 |
The predictions for the selected observations are presented below.
| 261 | 679 | |
|---|---|---|
| TARGET | 1 | 0 |
| GradientBoostingClassifier | 1 | 1 |
| LogisticRegression | 0 | 0 |
It's important to note that even though selected variables come from 2 different classes, both models recognize them as the same class. The difference being GradientBoostingClassifier predicts that both observations come from class 1, while LogisticRegression predicts that both belong to class 0.
The previously selected observations have different CP profiles which is best illustrated by $V1, V12, V15, V27$ variables.
For most variables, the difference between the model's predictions for the two selected observations is stable, just like when $V12 < 1.7$. However, we can see that when $V12 > 1.7$ even though predictions increase for both observations, the increase rate for observation $679$ is greater than for $267$, hence for $V12 > 2.2$ the prediction for $679$ is greater than for $267$, even though we would expect $679$ to have lower prediction since it belongs to class 0.
Another example of unstable behaviour that distinguishes CP profiles of the selected observations is the fact that for observation $261$ model prediction is rapidly decreasing when $V1$ is larger than 0.3, while prediction is increasing for observation $679$. Additionally for $0.7 < V15 < 0.9$ prediction for $261$ decreases, while for $679$ it increases. Then for $V15 > 0.9$, the predictions are stable again, however, the absolute difference between their values is larger than it was for $V15 < 0.7$. Similarly, for $V27$, the predictions for both observations are quite stable when $V27 < 0.6$ and $V27 > 1.5$, however, for $0.6 < V27 < 0.7$ prediction for observation $261$ rapidly decreases (later it stabilises at the similar level that prediction for $679$ used to have) meanwhile prediction for $679$ increases and later stabilises at the similar value that $261$'s prediction used to have. This behaviour of predictions might explain why we observe misclassification of those two observations.
Interestingly $V1$ and $V12$ were identified as the variables of high importance and had positive attribution for observation $879$ belonging to class 1, while $V1$ and $V27$ were of high importance and had negative for observation $1002$ belonging to class 0.
Above we can see an example of a PDP profile that indicates the high importance of the variable (left, $V18$) and a PDP profile that suggests low importance (right, $V17$).
Since CP is a local explanation, while PDP is global. Let's first explore the PDP for $V1, V12, V15$ and $V27$, which were variables with the most different CP profiles for observations $261$ and $679$.
The fact that predictions are stable for $V15$ based on the PDP profile is in contrast with previously analysed CP profiles, which could suggest that $V15$ is an important predictive variable. This is a good example of the fact that even though locally CP profiles might suggest that the effect of certain variables on predictions is significant it might not be in agreement with the PDP profile for this variable.
PDP profile analysis shows that $V1$ and $V12$ have a bigger impact on predictions in comparison to $V15$. The impact of $V27$ value on prediction seems to be smaller than $V1$ and $V12$ and larger than $V15$ based on PDP profiles. Moreover, $V1$ and $V12$ values have a wider range than $V15$ and $V27$.
For variables $V18$ and $V17$ their level of importance indicated by PDP profiles is the same for both models ($V18$ - high, $V17$ - low) as can be seen below.
While comparing the PDP profiles between the models the first thing worth noting is that LogisticRegression has smooth variable explanations, while GradientBoostingClassifier's variable explanations are step functions with some variability.
For $V18$ the largest difference between models is when $V18 < 0$, while for $V17$ the difference increases with the variable values for $V17 > 0$. This is expected since ensemble tree classifiers usually shink predictions towards the average and are often not great for extrapolation outside the range of values observed in the training dataset.
The effect of $V1$ and $V15$ on the prediction seems to be underestimated by the GradientBoostingClassifier model. Both models similarly explain the effect of $V27$ on prediction which might explain why it was previously identified as important for explaining predictions of GradientBoostingClassifier (HW3, HW4). The LogisticRegression model does not capture the more complicated relationship between $V12$ and prediction. In theory, if we were to address those issues we might be able to improve model predictive performance. That would mean that previously selected observations $261$ and $679$ could be correctly classified.
import dalex as dx
import lime
import shap
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import random
import warnings
warnings.filterwarnings("ignore")
import plotly
plotly.offline.init_notebook_mode()
url = "https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/qsar-biodeg.csv"
df = pd.read_csv(url,index_col=0)
# change target labels 2->1 and 1->0
df["TARGET"]-= 1
display(df.head())
X = df.loc[:, df.columns != 'TARGET']
y = df["TARGET"]
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.919 | 2.6909 | 31.4 | 0.000 | 3.106 | 2.550 | 9.002 | 0.960 | 1.142 | 1.201 | 1.932 | 0.011 | 0.000 | 4.489 | 2.949 | 1.591 | 7.253 | 1 |
| 1 | 4.170 | 2.1144 | 30.8 | 0.000 | 2.461 | 1.393 | 8.723 | 0.989 | 1.144 | 1.104 | 2.214 | -0.204 | 0.000 | 1.542 | 3.315 | 1.967 | 7.257 | 1 |
| 2 | 3.932 | 3.2512 | 26.7 | 0.000 | 3.279 | 2.585 | 9.110 | 1.009 | 1.152 | 1.092 | 1.942 | -0.008 | 0.000 | 4.891 | 3.076 | 2.417 | 7.601 | 1 |
| 3 | 3.000 | 2.7098 | 20.0 | 0.000 | 2.100 | 0.918 | 6.594 | 1.108 | 1.167 | 1.024 | 1.414 | 1.073 | 8.361 | 1.333 | 3.046 | 5.000 | 6.690 | 1 |
| 4 | 4.236 | 3.3944 | 29.4 | -0.271 | 3.449 | 2.753 | 9.528 | 1.004 | 1.147 | 1.137 | 1.985 | -0.002 | 10.348 | 5.588 | 3.351 | 2.405 | 8.003 | 1 |
# Spliting data
SEED = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
scaler = StandardScaler().fit(X_train)
#scaling - transforming
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# convert back to dataframes
X_train = pd.DataFrame(X_train, columns = X.columns, index = y_train.index)
X_test = pd.DataFrame(X_test, columns = X.columns, index = y_test.index)
models = {
'GradientBoostingClassifier': GradientBoostingClassifier(random_state=SEED, n_estimators=100),
'LogisticRegression': LogisticRegression(random_state=SEED, penalty='l2', solver = 'lbfgs'),
}
explainers = {}
for name, model in models.items():
model.fit(X_train, y_train)
explainer = dx.Explainer(model, X_test, y_test)
# save explainer
explainers[name] = explainer
Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.ensemble._gb.GradientBoostingClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000021749659CA0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.00988, mean = 0.358, max = 0.983 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.926, mean = -0.0782, max = 0.934 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.linear_model._logistic.LogisticRegression (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000021749659CA0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 9.77e-11, mean = 0.348, max = 0.98 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.968, mean = -0.0685, max = 0.994 -> model_info : package sklearn A new explainer has been created!
performance = pd.concat([explainer.model_performance().result for explainer in explainers.values()], axis=0)
performance
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.857820 | 0.925680 |
| LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 |
np.random.seed(SEED)
# y_test_0 = y_test[y_test == 0]
# y_test_1 = y_test[y_test == 1]
# idx_0 = np.random.choice(y_test_0.index, 1)
# idx_1 = np.random.choice(y_test_1.index, 1)
# idx = np.concatenate((idx_1, idx_0))
idx = np.array([261, 679])
obs_var = X_test.loc[idx]
obs_target = y_test[idx].to_frame()
obs_df = obs_var.join(obs_target)
display(obs_df)
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 261 | 0.570528 | -1.026712 | -0.199308 | 2.147546 | 0.090398 | 0.734119 | 0.567574 | -0.658983 | 0.114451 | -0.294956 | 0.088071 | -0.002628 | 0.152555 | 0.565992 | -0.498760 | -0.447733 | -0.311315 | 1 |
| 679 | 1.289697 | -0.922888 | 0.170087 | 2.225700 | -0.573715 | -0.644963 | 0.699339 | -1.060742 | -0.113295 | 0.506875 | 1.388732 | -0.033523 | -0.751856 | -0.876388 | -0.267852 | -1.326784 | -0.507680 | 0 |
pred_df = obs_target.copy()
for name, model in models.items():
pred_df[name] = model.predict(obs_var)
display(pred_df.T)
| 261 | 679 | |
|---|---|---|
| TARGET | 1 | 0 |
| GradientBoostingClassifier | 1 | 1 |
| LogisticRegression | 0 | 0 |
e.g. in Python: AIX360, Alibi, dalex, PDPbox; in R: pdp, DALEX, ALEplot.
*implement CP yourself for a potential bonus point
explainer = explainers['GradientBoostingClassifier']
np.random.seed(SEED)
# cp = [explainer.predict_profile(new_observation=obs_var.loc[[i]]) for i in obs_var.index]
cp_all = explainer.predict_profile(new_observation=obs_var)
Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 17/17 [00:00<00:00, 366.50it/s]
cp_all.plot(variables = ["V1", "V12", "V15", "V27"], color='_ids_')
For example, model predictions are increasing with age for one observation and decreasing with age for another one. NOTE that you will need to have a model with interactions to observe such differences.
The observations $261, 679$ satisfy this condition. See above.
*implement PDP yourself for a potential bonus point
explainer = explainers['GradientBoostingClassifier']
pdp = explainer.model_profile()
Calculating ceteris paribus: 100%|█████████████████████████████████████████████████████| 17/17 [00:00<00:00, 38.48it/s]
# pdp.plot(variables = ["V1", "V12", "V15", "V27"], geom="profiles", title="Partial Dependence Plot with individual profiles")
pdp.plot(variables = ["V1", "V12", "V15", "V27"], title="Partial Dependence Plot")
# pdp.plot(variables=["V18", "V17"], geom="profiles", title="Partial Dependence Plot with individual profiles")
pdp.plot(variables=["V18", "V17"], title="Partial Dependence Plot")
explainer = explainers['LogisticRegression']
pdp_lr = explainer.model_profile()
Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 17/17 [00:00<00:00, 106.23it/s]
pdp_lr.plot(pdp, variables=["V18", "V17"], title="Partial Dependence Plot")
pdp_lr.plot(pdp, variables=["V1", "V12", "V15", "V27"], title="Partial Dependence Plot")
pdp_lr.plot(pdp, title="Partial Dependence Plot")